1.INTRODUCTION:

The automotive trends dataset examines trends in vehicle parameters like weight, horsepower, and many others as well as fuel economy, CO2 emissions, production share, and other factors.

2) Data

i) Data source:

The data for this analysis came from automotive trends, and the United States Environmental Protection Agency [EPA] provided a dataset.

Below is the link for the data source:

https://www.epa.gov/automotive-trends/explore-automotive-trends-data#SummaryData

ii) Collection of Data :

The EPA collected data via laboratory testing or directly from manufacturers using official EPA test procedures.Moreover,Since 1975, the EPA has maintained the dataset to provide the public with information about fuel economy, technology data, emissions, and auto manufacturers’ performance in meeting the agency’s greenhouse gas emissions standards, as well as to support national programs.

iii) Cases :

The cases,each row in this dataset represents the specifics of each vehicle type’s performance during a given year.

iv) Variables :

The variables used in this analysis are listed below.

Model year: Model year: From 1975 to 2022, we can see the vehicle’s model year in this dataset.

Regulatory class : This category indicates whether the vehicle is a car or a truck.

Vehicle type: We can see what type of vehicle it is in this category, whether it is a car or a truck, such as a car SUV, van or minivan, pickup, sedan, wagon, or truck SUV.

Production Share: This variable indicates the production share.

Real word Id. MPG :We can see the actual miles per gallon here.

Real.World.MPG_City : In this column we can see what is miles per gallon for each city

Real.World.MPG_Hwy : The miles per gallon for each city can be found in this column.

Real.World.CO2..g.mi : The CO2 emission grams per mile in the real world are shown in this category.

Real.World.CO2_City..g.mi : The carbon dioxide (CO2) grams per mile in real-world cities are displayed here.

Real.World.CO2_Hwy..g.mi : In this category, we can see the CO2 emissions grams per mile for real-world highways.

Weight : We can see that the weight units are measured in lbs.

Horsepower : As we can see, horsepower is measured in hp.

v) Type of study

The dataset I have taken is from an observational study.

vi)Data Quality

Step 1: Need to load the necessary packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(readxl)

Step 2: Need to read the data

rawdata <- read.csv("~/Desktop/201/ Automotive_Trends Report_Project.csv")

Step 3 : Examining the data for the first sixth rows

head(rawdata)
##   Model.Year Regulatory.Class Vehicle.Type Production.Share Real.World.MPG
## 1       1975              All          All                1       13.05970
## 2       1975              Car      All Car         0.806646       13.45483
## 3       1975              Car  Sedan/Wagon         0.805645       13.45833
## 4       1975            Truck    All Truck         0.193354       11.63431
## 5       1975            Truck       Pickup         0.131322       11.91476
## 6       1975            Truck  Minivan/Van           0.0447       11.10606
##   Real.World.MPG_City Real.World.MPG_Hwy Real.World.CO2..g.mi.
## 1            12.01552           14.61167              680.5961
## 2            12.31413           15.17266              660.6374
## 3            12.31742           15.17643              660.4660
## 4            10.91165           12.65900              763.8613
## 5            11.07827           13.12613              745.8814
## 6            10.55642           11.86084              800.1940
##   Real.World.CO2_City..g.mi. Real.World.CO2_Hwy..g.mi. Weight..lbs.
## 1                   739.7380                  608.3116     4060.399
## 2                   721.8293                  585.8472     4057.494
## 3                   721.6367                  585.7019     4057.565
## 4                   814.4506                  702.0300     4072.518
## 5                   802.2009                  677.0464     4011.977
## 6                   841.8573                  749.2722     4195.690
##   Horsepower..HP. Footprint..sq..ft..
## 1        137.3346                   -
## 2        136.1964                   -
## 3        136.2256                   -
## 4        142.0826                   -
## 5        140.9365                   -
## 6        143.2245                   -

Output: We observed the preceding data for the first six rows.

Step 4: Need to check if the data loaded correctly or not.

dim(rawdata)
## [1] 384  13

Output : According to the above observation, there are 384 rows and 13 columns.

Step 5 : Examining the data for the last sixth rows

tail(rawdata)
##       Model.Year Regulatory.Class Vehicle.Type Production.Share Real.World.MPG
## 379 Prelim. 2022              Car      Car SUV                -       32.38793
## 380 Prelim. 2022              All          All                -       26.35965
## 381 Prelim. 2022            Truck  Minivan/Van                -       25.59317
## 382 Prelim. 2022            Truck    Truck SUV                -       24.75038
## 383 Prelim. 2022            Truck    All Truck                -       23.40912
## 384 Prelim. 2022            Truck       Pickup                -       20.06288
##     Real.World.MPG_City Real.World.MPG_Hwy Real.World.CO2..g.mi.
## 379            29.25306           35.23655              261.7094
## 380            23.17949           29.40284              330.8116
## 381            22.10621           29.04996              344.2938
## 382            21.90441           27.43990              354.1329
## 383            20.60126           26.09186              375.9269
## 384            17.49366           22.56268              442.4302
##     Real.World.CO2_City..g.mi. Real.World.CO2_Hwy..g.mi. Weight..lbs.
## 379                   291.7420                  239.0533     3832.524
## 380                   377.1848                  295.8283     4328.963
## 381                   398.0267                  303.7584     4557.279
## 382                   400.5455                  319.1199     4534.261
## 383                   427.5858                  336.9561     4713.739
## 384                   508.0322                  392.9410     5239.220
##     Horsepower..HP. Footprint..sq..ft..
## 379        269.6559            47.34132
## 380        272.3535            51.67437
## 381        245.0592            56.21571
## 382        268.1756            50.02365
## 383        284.8583            54.37582
## 384        339.0876            65.91698

Output : We examined the last six rows of data.

Step 6 : Determine whether it’s a data frame or a tibble.

If it is a data frame, the result will be True. If it is a tibble, the output will be False, so we must convert it to a tibble using as_tibble.

is.data.frame(rawdata)
## [1] TRUE

Output: The automotive data is in data frame format.

Step 7 : Need to use the function as_tibble to determine whether the rawdata dataset is a tibble or not.

 newdata<- as_tibble(rawdata)
is_tibble(newdata)
## [1] TRUE

Output: Because we received an output of “true,” the above data is a tibble.

Step 8 : Check for summary of rawdata

summary(rawdata)
##   Model.Year        Regulatory.Class   Vehicle.Type       Production.Share  
##  Length:384         Length:384         Length:384         Length:384        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Real.World.MPG  Real.World.MPG_City Real.World.MPG_Hwy Real.World.CO2..g.mi.
##  Min.   :10.53   Min.   : 9.393      Min.   :10.81      Min.   :254.0        
##  1st Qu.:17.03   1st Qu.:15.001      1st Qu.:19.33      1st Qu.:386.2        
##  Median :19.38   Median :16.898      Median :22.54      Median :458.9        
##  Mean   :20.00   Mean   :17.491      Mean   :22.91      Mean   :466.6        
##  3rd Qu.:23.02   3rd Qu.:19.755      3rd Qu.:27.02      3rd Qu.:522.6        
##  Max.   :33.71   Max.   :29.253      Max.   :38.19      Max.   :844.0        
##  Real.World.CO2_City..g.mi. Real.World.CO2_Hwy..g.mi.  Weight..lbs. 
##  Min.   :291.7              Min.   :222.7             Min.   :2630  
##  1st Qu.:449.7              1st Qu.:328.6             1st Qu.:3536  
##  Median :526.5              Median :395.2             Median :3991  
##  Mean   :528.6              Mean   :410.4             Mean   :3987  
##  3rd Qu.:592.6              3rd Qu.:459.9             3rd Qu.:4415  
##  Max.   :946.2              Max.   :822.0             Max.   :5485  
##  Horsepower..HP.  Footprint..sq..ft..
##  Min.   : 87.81   Length:384         
##  1st Qu.:138.15   Class :character   
##  Median :178.48   Mode  :character   
##  Mean   :183.10                      
##  3rd Qu.:215.07                      
##  Max.   :345.67

Output : The dataset has 384 observations and 13 variables, there are ten numeric variables.

Step 9 : Need to check quality of data and need to calculate sum of the values of each column by using a function called colSums

colSums(is.na(rawdata))
##                 Model.Year           Regulatory.Class 
##                          0                          0 
##               Vehicle.Type           Production.Share 
##                          0                          0 
##             Real.World.MPG        Real.World.MPG_City 
##                          0                          0 
##         Real.World.MPG_Hwy      Real.World.CO2..g.mi. 
##                          0                          0 
## Real.World.CO2_City..g.mi.  Real.World.CO2_Hwy..g.mi. 
##                          0                          0 
##               Weight..lbs.            Horsepower..HP. 
##                          0                          0 
##        Footprint..sq..ft.. 
##                          0

Output: We received a result of 0, indicating that there are no missing values in any of the columns.

Step 10 : Check for any duplicate data.

sum(duplicated(rawdata))
## [1] 0

Output : There are no values that are duplicates or missing.

Step 11 : Production share and footprint must be converted into a factor.

rawdata$Production.Share <- as.factor(rawdata$Production.Share)
rawdata$Footprint..sq..ft..<-as.factor(rawdata$Footprint..sq..ft..)
rawdata %>% glimpse()
## Rows: 384
## Columns: 13
## $ Model.Year                 <chr> "1975", "1975", "1975", "1975", "1975", "19…
## $ Regulatory.Class           <chr> "All", "Car", "Car", "Truck", "Truck", "Tru…
## $ Vehicle.Type               <chr> "All", "All Car", "Sedan/Wagon", "All Truck…
## $ Production.Share           <fct> 1, 0.806646, 0.805645, 0.193354, 0.131322, …
## $ Real.World.MPG             <dbl> 13.05970, 13.45483, 13.45833, 11.63431, 11.…
## $ Real.World.MPG_City        <dbl> 12.01552, 12.31413, 12.31742, 10.91165, 11.…
## $ Real.World.MPG_Hwy         <dbl> 14.61167, 15.17266, 15.17643, 12.65900, 13.…
## $ Real.World.CO2..g.mi.      <dbl> 680.5961, 660.6374, 660.4660, 763.8613, 745…
## $ Real.World.CO2_City..g.mi. <dbl> 739.7380, 721.8293, 721.6367, 814.4506, 802…
## $ Real.World.CO2_Hwy..g.mi.  <dbl> 608.3116, 585.8472, 585.7019, 702.0300, 677…
## $ Weight..lbs.               <dbl> 4060.399, 4057.494, 4057.565, 4072.518, 401…
## $ Horsepower..HP.            <dbl> 137.3346, 136.1964, 136.2256, 142.0826, 140…
## $ Footprint..sq..ft..        <fct> -, -, -, -, -, -, -, -, -, -, -, -, -, -, -…

Output : There are three categorical values and ten numerical values in the data. As character variables, the production share and footprint values are converted to factors for further analysis.

Step 12 : Hypothesis testing1

I have taken sample data from Real.World.MPG_City.of automotive trends differs significantly from 17.491, a one-sample t-test will be used. The limit for significance will be set at 0.05. The null hypothesis in this test is that the mean weight in lbs of automotive trends equals 3884, while the alternate hypothesis is that the mean weight in lbs of automotive trends does not equal 3884.

Real.World.CO2_Hwy..g.mi.

t.test(rawdata$ Real.World.MPG_City, mu = 17.491)
## 
##  One Sample t-test
## 
## data:  rawdata$Real.World.MPG_City
## t = 0.0003001, df = 383, p-value = 0.9998
## alternative hypothesis: true mean is not equal to 17.491
## 95 percent confidence interval:
##  17.13805 17.84406
## sample estimates:
## mean of x 
##  17.49105

Output: If the P value is less than 0.05, we can reject the null hypothesis but from the above result we can see the P value is greater than alpha value so we cannot reject the null hypothesis.

Step 13 : Need to test the relationship between Vehicle weight and horsepower to the test

We need to determine whether there is a relationship between Vehicle weight and horsepower. We use a correlation test to compute the Pearson correlation coefficient between two variables. We can reject the null hypothesis and conclude that there is a relationship between Vehicle weight and horsepower if the P value is less than the significance level.

cor.test(rawdata$Weight..lbs., rawdata$ Horsepower..HP., method = 'pearson' ) 
## 
##  Pearson's product-moment correlation
## 
## data:  rawdata$Weight..lbs. and rawdata$Horsepower..HP.
## t = 21.81, df = 382, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6965565 0.7862007
## sample estimates:
##       cor 
## 0.7447192

Output: We can see from the above test that the P value is greater than the significance level, so we can accept the null hypothesis and conclude that Vehicle weight and horsepower have a linear relationship.

Step 14 : Hypothesis testing2

To determine whether the mean of Real.World.CO2_City is less than 400 grams per mile, we will use a one-sided T test. The null hypothesis states that the mean of Real.World.CO2_City data for automotive trends is greater than or equal to 400 grams per mile, while the alternate hypothesis states that the mean of Real.World.CO2_City data for automotive trends is less than or equal to 400 grams per mile.

t.test(rawdata$Real.World.CO2_City..g.mi., mu = 400 , alternative = "less" ) 
## 
##  One Sample t-test
## 
## data:  rawdata$Real.World.CO2_City..g.mi.
## t = 23.499, df = 383, p-value = 1
## alternative hypothesis: true mean is less than 400
## 95 percent confidence interval:
##      -Inf 537.6543
## sample estimates:
## mean of x 
##  528.6287

Output : Because the P value is greater than 0.05, we cannot reject the null hypothesis and conclude that there is insufficient evidence to support a difference in the mean of Real.World. For automotive trends, the CO2_City data is 400 grams per mile.

Step 15 : Need to conduct a two tail test to see if car fuel efficiency of car suv is equal to Truck suv.

H0 = mean of CarSUV = mean of TruckSUV

H1 = mean of CarSUV ≠ mean of TruckSUV

CarSUV <- rawdata$Real.World.MPG[rawdata$Vehicle.Type == "Car SUV"]
TruckSUV <- rawdata$Real.World.MPG[rawdata$Vehicle.Type == "Truck SUV"]
print(M1 <- mean(CarSUV))
## [1] 20.22622
print(M2 <- mean(TruckSUV))
## [1] 17.44931
standard_deviation1 <- sd(CarSUV)
print(paste('Standard Deviation of Car SUV: ', standard_deviation1))
## [1] "Standard Deviation of Car SUV:  4.76605239040631"
standard_deviation2 <- sd(TruckSUV)
print(paste('Standard Deviation of TruckSUV: ', standard_deviation2))
## [1] "Standard Deviation of TruckSUV:  3.43779399551164"
n<-384
StandardError <- sqrt((standard_deviation1^2/n) + (standard_deviation2^2/n))
StandardError
## [1] 0.2998858
A1 <- M1
A2 <- M2
tstat <- (A1-A2)/StandardError
alpha = 0.05 # as two tailed test we are dividing alpha value 0.05/2 = 0.025
zscore <- 1.96
tstat
## [1] 9.259889
print(dof <- (n+n)-2)
## [1] 766
print(p_value <- 2 * pt(-abs(tstat), dof))
## [1] 2.014926e-19

Output : The null hypothesis is being rejected since, according to t-statistics, the p-value is smaller than the alpha value and the t value is greater than 1.96. Because it is a two-tailed test, we reject the null hypothesis if Z is less than -1.96 or more than 1.96. So the car fuel efficiency of car suv doest not equal to Truck suv.

Exploratory Data Analysis

summary(rawdata)
##   Model.Year        Regulatory.Class   Vehicle.Type       Production.Share
##  Length:384         Length:384         Length:384         1       : 47    
##  Class :character   Class :character   Class :character   -       :  8    
##  Mode  :character   Mode  :character   Mode  :character   0.00002 :  1    
##                                                           0.000032:  1    
##                                                           0.000925:  1    
##                                                           0.001001:  1    
##                                                           (Other) :325    
##  Real.World.MPG  Real.World.MPG_City Real.World.MPG_Hwy Real.World.CO2..g.mi.
##  Min.   :10.53   Min.   : 9.393      Min.   :10.81      Min.   :254.0        
##  1st Qu.:17.03   1st Qu.:15.001      1st Qu.:19.33      1st Qu.:386.2        
##  Median :19.38   Median :16.898      Median :22.54      Median :458.9        
##  Mean   :20.00   Mean   :17.491      Mean   :22.91      Mean   :466.6        
##  3rd Qu.:23.02   3rd Qu.:19.755      3rd Qu.:27.02      3rd Qu.:522.6        
##  Max.   :33.71   Max.   :29.253      Max.   :38.19      Max.   :844.0        
##                                                                              
##  Real.World.CO2_City..g.mi. Real.World.CO2_Hwy..g.mi.  Weight..lbs. 
##  Min.   :291.7              Min.   :222.7             Min.   :2630  
##  1st Qu.:449.7              1st Qu.:328.6             1st Qu.:3536  
##  Median :526.5              Median :395.2             Median :3991  
##  Mean   :528.6              Mean   :410.4             Mean   :3987  
##  3rd Qu.:592.6              3rd Qu.:459.9             3rd Qu.:4415  
##  Max.   :946.2              Max.   :822.0             Max.   :5485  
##                                                                     
##  Horsepower..HP.  Footprint..sq..ft..
##  Min.   : 87.81   -       :264       
##  1st Qu.:138.15   44.92996:  1       
##  Median :178.48   45.04628:  1       
##  Mean   :183.10   45.20013:  1       
##  3rd Qu.:215.07   45.21904:  1       
##  Max.   :345.67   45.31546:  1       
##                   (Other) :115

Step 16 : Calculate the mean, median, mode, standard deviation, variance, and range of summary statistics.

Mean

Mean_Real.World.MPG <- rawdata$Real.World.MPG
mean(Mean_Real.World.MPG)
## [1] 19.99682
Mean_Real.World.MPG_City <- rawdata$Real.World.MPG_City
mean(Mean_Real.World.MPG_City)
## [1] 17.49105
Mean_Real.World.MPG_Hwy <- rawdata$Real.World.MPG_Hwy
mean(Mean_Real.World.MPG_Hwy)
## [1] 22.91446
Mean_Real.World.CO2..g.mi. <- rawdata$Real.World.CO2..g.mi.
mean(Mean_Real.World.CO2..g.mi.)
## [1] 466.6172
Mean_Real.World.CO2_Hwy..g.mi. <- rawdata$Real.World.CO2_Hwy..g.mi.
mean(Mean_Real.World.CO2_Hwy..g.mi.)
## [1] 410.4031
Mean_Real.World.CO2_City..g.mi. <- rawdata$Real.World.CO2_City..g.mi.
mean(Mean_Real.World.CO2_City..g.mi.)
## [1] 528.6287
Mean_Weight..lbs.<- rawdata$Weight..lbs.
mean(Mean_Weight..lbs.)
## [1] 3987.069
Mean_Horsepower..HP. <- rawdata$Horsepower..HP.
mean(Mean_Horsepower..HP.)
## [1] 183.1011

Median

Median_Horsepower..HP. <- rawdata$Horsepower..HP.
median(Median_Horsepower..HP.)
## [1] 178.4841
Median_Real.World.MPG <- rawdata$Real.World.MPG
median(Median_Real.World.MPG)
## [1] 19.37544
Median_Real.World.CO2..g.mi. <- rawdata$Real.World.CO2..g.mi.
median(Median_Real.World.CO2..g.mi.)
## [1] 458.931
Median_Weight..lbs <- rawdata$Weight..lbs
median(Median_Weight..lbs)
## [1] 3991.068
Median_Real.World.MPG_City <- rawdata$Real.World.MPG_City
median(Median_Real.World.MPG_City)
## [1] 16.89832
Median_Real.World.MPG_Hwy <- rawdata$Real.World.MPG_Hwy
median(Median_Real.World.MPG_Hwy)
## [1] 22.53625
Median_Real.World.CO2_City..g.mi. <- rawdata$Real.World.CO2_City..g.mi.
median(Median_Real.World.CO2_City..g.mi.)
## [1] 526.5415
Median_Real.World.CO2_Hwy..g.mi. <- rawdata$Real.World.CO2_Hwy..g.mi.
median(Median_Real.World.CO2_Hwy..g.mi.)
## [1] 395.231

Range

range(rawdata$Real.World.CO2_Hwy..g.mi.)
## [1] 222.7413 821.9988
range(rawdata$Real.World.CO2_City..g.mi.)
## [1] 291.7420 946.1582
range(rawdata$Real.World.MPG_Hwy)
## [1] 10.81307 38.19438
range(rawdata$Real.World.MPG_City)
## [1]  9.39272 29.25306
range(rawdata$Weight..lbs.)
## [1] 2629.999 5484.824
range(rawdata$Real.World.CO2..g.mi)
## [1] 253.9547 844.0170
range(rawdata$Real.World.MPG)
## [1] 10.53097 33.71184
range(rawdata$Horsepower..HP.)
## [1]  87.8139 345.6733

Standard Deviation

sd(rawdata$Real.World.CO2_Hwy..g.mi.)
## [1] 104.3922
sd(rawdata$Real.World.CO2_City..g.mi.)
## [1] 107.2658
sd(rawdata$Real.World.MPG_Hwy)
## [1] 5.274994
sd(rawdata$Real.World.MPG_City)
## [1] 3.51826
sd(rawdata$Weight..lbs.)
## [1] 549.3396
sd(rawdata$Real.World.CO2..g.mi)
## [1] 107.4801
sd(rawdata$Real.World.MPG)
## [1] 4.374913
sd(rawdata$Horsepower..HP.)
## [1] 55.78506

Variance

var(rawdata$Real.World.CO2_Hwy..g.mi.)
## [1] 10897.72
var(rawdata$Real.World.CO2_City..g.mi.)
## [1] 11505.96
var(rawdata$Real.World.MPG_Hwy)
## [1] 27.82556
var(rawdata$Real.World.MPG_City)
## [1] 12.37815
var(rawdata$Weight..lbs.)
## [1] 301774
var(rawdata$Real.World.CO2..g.mi)
## [1] 11551.96
var(rawdata$Real.World.MPG)
## [1] 19.13986
var(rawdata$Horsepower..HP.)
## [1] 3111.973

Mode

HP <- rawdata$Horsepower..HP.
MPG <-rawdata$Real.World.MPG
World_CO2 <- rawdata$Real.World.CO2..g.mi
weight <- rawdata$Weight..lbs.
World.MPG_Hwy <- rawdata$Real.World.MPG_Hwy
World.CO2_Hwy <- rawdata$Real.World.CO2_Hwy..g.mi.
Mode <- function(x){
  ux <- unique(x)
  ux[which.max(tabulate(match(x,ux)))]
}
Mode(HP)
## [1] 137.3346
Mode(MPG)
## [1] 13.0597
Mode(World_CO2)
## [1] 680.5961
Mode(weight)
## [1] 4000
Mode(World.MPG_Hwy)
## [1] 14.61167
Mode(World.CO2_Hwy)
## [1] 608.3116

Need to check for IQR

IQR(rawdata$Real.World.MPG)
## [1] 5.991398
IQR(rawdata$Real.World.MPG_City)
## [1] 4.753562
IQR(rawdata$Real.World.MPG_Hwy)
## [1] 7.690288
IQR(rawdata$Real.World.CO2..g.mi.)
## [1] 136.4557
IQR(rawdata$.World.CO2_Hwy..g.mi)
## [1] NA
IQR(rawdata$Real.World.CO2_City..g.mi.)
## [1] 142.9086
IQR(rawdata$Weight..lbs.)
## [1] 878.8207
IQR(rawdata$Horsepower..HP.)
## [1] 76.91698

Need to check for numeric and categorical variables

str(rawdata)
## 'data.frame':    384 obs. of  13 variables:
##  $ Model.Year                : chr  "1975" "1975" "1975" "1975" ...
##  $ Regulatory.Class          : chr  "All" "Car" "Car" "Truck" ...
##  $ Vehicle.Type              : chr  "All" "All Car" "Sedan/Wagon" "All Truck" ...
##  $ Production.Share          : Factor w/ 331 levels "-","0.00002",..: 331 326 325 173 129 60 22 5 331 320 ...
##  $ Real.World.MPG            : num  13.1 13.5 13.5 11.6 11.9 ...
##  $ Real.World.MPG_City       : num  12 12.3 12.3 10.9 11.1 ...
##  $ Real.World.MPG_Hwy        : num  14.6 15.2 15.2 12.7 13.1 ...
##  $ Real.World.CO2..g.mi.     : num  681 661 660 764 746 ...
##  $ Real.World.CO2_City..g.mi.: num  740 722 722 814 802 ...
##  $ Real.World.CO2_Hwy..g.mi. : num  608 586 586 702 677 ...
##  $ Weight..lbs.              : num  4060 4057 4058 4073 4012 ...
##  $ Horsepower..HP.           : num  137 136 136 142 141 ...
##  $ Footprint..sq..ft..       : Factor w/ 121 levels "-","44.92996",..: 1 1 1 1 1 1 1 1 1 1 ...

Output: There are three category variables, as can be seen from the output above.

Step 17 : check for numeric variables

numeric_dataset <- function(Dataset){
nums <- sapply(Dataset, is.numeric)
return(Dataset[ , nums])
  }
automotive_num <- numeric_dataset(rawdata)
str(automotive_num) 
## 'data.frame':    384 obs. of  8 variables:
##  $ Real.World.MPG            : num  13.1 13.5 13.5 11.6 11.9 ...
##  $ Real.World.MPG_City       : num  12 12.3 12.3 10.9 11.1 ...
##  $ Real.World.MPG_Hwy        : num  14.6 15.2 15.2 12.7 13.1 ...
##  $ Real.World.CO2..g.mi.     : num  681 661 660 764 746 ...
##  $ Real.World.CO2_City..g.mi.: num  740 722 722 814 802 ...
##  $ Real.World.CO2_Hwy..g.mi. : num  608 586 586 702 677 ...
##  $ Weight..lbs.              : num  4060 4057 4058 4073 4012 ...
##  $ Horsepower..HP.           : num  137 136 136 142 141 ...

Only numerical variables are visible in the output from the above query.

Step 18 : Checking the first six rows is required.

head(automotive_num)
##   Real.World.MPG Real.World.MPG_City Real.World.MPG_Hwy Real.World.CO2..g.mi.
## 1       13.05970            12.01552           14.61167              680.5961
## 2       13.45483            12.31413           15.17266              660.6374
## 3       13.45833            12.31742           15.17643              660.4660
## 4       11.63431            10.91165           12.65900              763.8613
## 5       11.91476            11.07827           13.12613              745.8814
## 6       11.10606            10.55642           11.86084              800.1940
##   Real.World.CO2_City..g.mi. Real.World.CO2_Hwy..g.mi. Weight..lbs.
## 1                   739.7380                  608.3116     4060.399
## 2                   721.8293                  585.8472     4057.494
## 3                   721.6367                  585.7019     4057.565
## 4                   814.4506                  702.0300     4072.518
## 5                   802.2009                  677.0464     4011.977
## 6                   841.8573                  749.2722     4195.690
##   Horsepower..HP.
## 1        137.3346
## 2        136.1964
## 3        136.2256
## 4        142.0826
## 5        140.9365
## 6        143.2245

The output from the above shows the first rows of data.

Step 19 Determine the vehicle-based total mean, median, standard deviation, and range. Type.

Aggregate Mean

aggregate_mean <- aggregate(automotive_num, by = list(rawdata$Vehicle.Type), FUN = mean)
colnames(aggregate_mean)[1] <- "Vehicle.Type"
dim(aggregate_mean)
## [1] 8 9
aggregate_mean
##   Vehicle.Type Real.World.MPG Real.World.MPG_City Real.World.MPG_Hwy
## 1          All       21.05023            18.36558           24.21940
## 2      All Car       23.80026            20.58441           27.65969
## 3    All Truck       17.77338            15.69641           20.12641
## 4      Car SUV       20.22622            17.84962           22.93543
## 5  Minivan/Van       18.45708            16.06780           21.21303
## 6       Pickup       17.09726            15.14403           19.32361
## 7  Sedan/Wagon       24.12080            20.80189           28.11248
## 8    Truck SUV       17.44931            15.41869           19.72568
##   Real.World.CO2..g.mi. Real.World.CO2_City..g.mi. Real.World.CO2_Hwy..g.mi.
## 1              431.3862                   492.3273                  375.8517
## 2              385.4366                   443.0676                  332.2874
## 3              511.9602                   576.1101                  454.6119
## 4              465.4647                   524.2578                  410.7897
## 5              500.2546                   566.8536                  440.8516
## 6              527.1617                   593.8871                  468.0765
## 7              381.3337                   439.1042                  328.0781
## 8              529.9395                   593.4221                  472.6779
##   Weight..lbs. Horsepower..HP.
## 1     3780.094        176.0955
## 2     3414.820        159.5193
## 3     4356.861        199.8513
## 4     3738.246        167.2514
## 5     4328.552        189.1294
## 6     4481.193        215.5737
## 7     3386.146        158.9195
## 8     4410.642        198.4687

Aggregate Standard Deviation

aggregate_Standard_Deviation <- aggregate(automotive_num, by = list(rawdata$Vehicle.Type), FUN = sd)
colnames(aggregate_Standard_Deviation)[1] <- "Vehicle.Type"
head(aggregate_Standard_Deviation)
##   Vehicle.Type Real.World.MPG Real.World.MPG_City Real.World.MPG_Hwy
## 1          All       2.896313            2.305560           3.427399
## 2      All Car       4.069109            3.272402           4.700826
## 3    All Truck       2.611033            2.012260           3.202051
## 4      Car SUV       4.766052            4.033952           5.377006
## 5  Minivan/Van       3.439250            2.506315           4.393650
## 6       Pickup       1.825319            1.542381           2.268045
##   Real.World.CO2..g.mi. Real.World.CO2_City..g.mi. Real.World.CO2_Hwy..g.mi.
## 1              70.33875                   69.91426                  66.41512
## 2              78.08117                   78.64341                  71.88232
## 3              83.04248                   77.73747                  84.51645
## 4             124.29870                  131.47756                 110.86445
## 5             107.65964                   95.28129                 112.57889
## 6              65.51111                   64.48457                  67.31921
##   Weight..lbs. Horsepower..HP.
## 1     346.2237        49.73348
## 2     252.1981        38.06582
## 3     383.0226        60.31795
## 4     255.7871        39.51241
## 5     185.8731        47.97849
## 6     643.3320        82.74474

Aggregrate Variance

aggregate_var <- aggregate(automotive_num, by = list(rawdata$Vehicle.Type), FUN = var)
colnames(aggregate_var)[1] <- "Vehicle.Type"
head(aggregate_var)
##   Vehicle.Type Real.World.MPG Real.World.MPG_City Real.World.MPG_Hwy
## 1          All       8.388631            5.315607          11.747064
## 2      All Car      16.557648           10.708616          22.097765
## 3    All Truck       6.817493            4.049190          10.253131
## 4      Car SUV      22.715255           16.272772          28.912195
## 5  Minivan/Van      11.828443            6.281613          19.304160
## 6       Pickup       3.331790            2.378940           5.144027
##   Real.World.CO2..g.mi. Real.World.CO2_City..g.mi. Real.World.CO2_Hwy..g.mi.
## 1              4947.539                   4888.004                  4410.969
## 2              6096.669                   6184.786                  5167.068
## 3              6896.053                   6043.115                  7143.031
## 4             15450.168                  17286.348                 12290.926
## 5             11590.599                   9078.524                 12674.007
## 6              4291.706                   4158.260                  4531.876
##   Weight..lbs. Horsepower..HP.
## 1    119870.83        2473.419
## 2     63603.90        1449.007
## 3    146706.34        3638.255
## 4     65427.03        1561.230
## 5     34548.82        2301.936
## 6    413876.12        6846.692

Aggregate Mode

aggregate_Mode <- aggregate(automotive_num, by = list(rawdata$Vehicle.Type), FUN = mode)
colnames(aggregate_Mode)[1] <- "Vehicle.Type"
head(aggregate_Mode)
##   Vehicle.Type Real.World.MPG Real.World.MPG_City Real.World.MPG_Hwy
## 1          All        numeric             numeric            numeric
## 2      All Car        numeric             numeric            numeric
## 3    All Truck        numeric             numeric            numeric
## 4      Car SUV        numeric             numeric            numeric
## 5  Minivan/Van        numeric             numeric            numeric
## 6       Pickup        numeric             numeric            numeric
##   Real.World.CO2..g.mi. Real.World.CO2_City..g.mi. Real.World.CO2_Hwy..g.mi.
## 1               numeric                    numeric                   numeric
## 2               numeric                    numeric                   numeric
## 3               numeric                    numeric                   numeric
## 4               numeric                    numeric                   numeric
## 5               numeric                    numeric                   numeric
## 6               numeric                    numeric                   numeric
##   Weight..lbs. Horsepower..HP.
## 1      numeric         numeric
## 2      numeric         numeric
## 3      numeric         numeric
## 4      numeric         numeric
## 5      numeric         numeric
## 6      numeric         numeric

Aggregate Median

aggregate_Median <- aggregate(automotive_num, by = list(rawdata$Vehicle.Type), FUN = median)
colnames(aggregate_Median)[1] <- "Vehicle.Type"
head(aggregate_Median)
##   Vehicle.Type Real.World.MPG Real.World.MPG_City Real.World.MPG_Hwy
## 1          All       20.96036            18.74020           24.21608
## 2      All Car       23.15597            19.95090           27.44609
## 3    All Truck       17.40234            15.67611           19.77885
## 4      Car SUV       19.36475            17.10980           22.23017
## 5  Minivan/Van       18.30902            16.02602           21.43195
## 6       Pickup       17.32420            15.17568           19.82763
##   Real.World.CO2..g.mi. Real.World.CO2_City..g.mi. Real.World.CO2_Hwy..g.mi.
## 1              425.0867                   474.9688                  367.0256
## 2              383.8861                   445.8605                  323.8340
## 3              511.8291                   567.1100                  449.4054
## 4              458.9310                   519.4158                  399.7855
## 5              485.3905                   554.5356                  414.6642
## 6              513.0584                   585.7171                  448.2758
##   Weight..lbs. Horsepower..HP.
## 1     3896.740        175.1865
## 2     3460.863        161.9397
## 3     4407.449        194.0311
## 4     3805.244        171.6564
## 5     4373.051        181.3874
## 6     4377.310        199.1631

Output : Calculations show that a car SUV gets fewer miles per gallon on average than a sedan, and that the SUV gets higher values for CO2 emissions, weight, and horsepower. Compared to mpg, the emission, weight, and horsepower variables have a higher level of data variance.However, further research must be done before making any firm conclusions. The data suggests that sedans and wagons perform better than automobiles and SUVs in terms of miles per gallon when driving in cities or on highways and release fewer emissions.

Step 20: Relation between effeciency and Horsepower of vehicle across vehicle types

ggplot(rawdata, aes(x=Horsepower..HP., y = Real.World.MPG, group=Vehicle.Type, fill=Vehicle.Type, color=Vehicle.Type)) + stat_summary(fun = sum, na.rm=TRUE, geom='line')+ ggtitle('Vehicle efficiency againt Power')+theme(axis.text.x=element_blank(),axis.ticks.x=element_blank())

Output: However, it can be shown that after adding a particular number of pounds, a car or SUV’s miles per gallon dramatically rose. This is contrary to how miles per gallon decline as weight increases.

Step 21 : Efficiency and vehicle weight relationships, according to vehicle types

ggplot(rawdata, aes(x=Weight..lbs., y = Real.World.MPG, group=Vehicle.Type, fill=Vehicle.Type, color=Vehicle.Type)) + stat_summary(fun = sum, na.rm=TRUE, geom='line')+ ggtitle('Vehicle efficiency againt Weight')+theme(axis.text.x=element_blank(),axis.ticks.x=element_blank())

Output : But it can be demonstrated that adding a certain number of pounds has a significant impact on an automobile or SUV’s MPG. Additionally, the miles per gallon decrease as the weight increases.

Step 22 :Plots are used to visualize the relationships between the variables.

Need to plot a histogram

dataframe <- data.frame(
  MPG = rawdata$Real.World.MPG,
  CO2 = rawdata$Real.World.CO2..g.mi.,
  W = rawdata$Weight..lbs.,
  HP = rawdata$Horsepower..HP.
  )
par(mfrow=c(1,4)) # Need to set up a 1x4 grid of plots
for(i in 1:ncol(dataframe)) {
  hist(dataframe[,i], main=colnames(df)[i], xlab="", col="pink")
}

Output : Almost all of the variables follow a normal distribution. Both MPG and CO2 have a right-skewed distribution.

.Step 23: Plotting the effectiveness of distinct vehicle classes is needed

ggplot(rawdata, aes(x=Vehicle.Type, y=Real.World.MPG)) + 
  geom_boxplot(fill="grey", color="blue") + 
  ggtitle("Fuel economy by vehicle class") +
  ylab("Miles per gallon")

Output : The plots show an average fuel economy of 21 MPG across all vehicle categories, and it is clear that sedans have higher fuel efficiency than vehicles, SUVs, and trucks, which have the lowest efficiency.

Step 24: Vehicle effectiveness over time

ggplot(data =  rawdata) + geom_point(mapping = aes(x=Model.Year, y = Real.World.MPG, color = Vehicle.Type), position = "jitter", alpha = 0.2, show.legend = FALSE) + facet_wrap(~ Vehicle.Type)

Output: Nowadays, all sorts of cars run more effectively, and efficiency is generally on the rise.Furthermore, sedans have continuously had higher MPG over the years when compared to other vehicle classifications.

Step 25: The relationship between vehicle efficiency and weight needs to be examined and plotted.

ggplot(data = rawdata) + geom_point(mapping = aes(x=Weight..lbs., y = Real.World.MPG), position = "jitter", alpha = 0.5) + geom_smooth(mapping = aes(x=Weight..lbs., y = Real.World.MPG)) + labs(x = "Weight (lbs)", y = "Miles per Gallon", title = " miles per gallon and weight correlation")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Output: According to the Environmental Protection Agency, a car’s fuel economy improves by 1-2% for every 100 pounds eliminated from it. It’s very amazing how the MPG first increased as the vehicle’s weight increased up to 3500 LBS. After this, however, the efficiency dropped as the weight increased as was to be expected.

Step 26 : over time, the emissions trend

ggplot(rawdata, aes(x=Model.Year, y=Real.World.CO2..g.mi., group=Vehicle.Type, fill=Vehicle.Type, color=Vehicle.Type)) + stat_summary(fun = sum, na.rm=TRUE, geom='line')+ ggtitle('Emission for vehicle types across years')+theme(axis.text.x=element_blank(),axis.ticks.x=element_blank()) +theme(axis.text.y=element_blank(),axis.ticks.y=element_blank())

Output : Emissions trend over time for all vehicle classes over the years, the emission shows a declining tendency, with sedans and wagons having the lowest levels. This shows that automakers are developing fuel-efficient automobiles that produce less emissions.Moreover,it can be shown through the exploratory data analysis that the vehicle would operate more efficiently with the ideal balance of weight, horsepower, and additional factors like engine conditions and aerodynamics, which are not included in this data set. Choosing a fuel-efficient car is crucial if you want to keep the environment cleaner and reduce pollution.

Step 27: Relation between effeciency and Horsepower

ggplot(data = rawdata) + geom_point(mapping = aes(x=Horsepower..HP., y = Real.World.MPG), position = "jitter", alpha = 0.1) + geom_smooth(mapping = aes(x=Horsepower..HP., y = Real.World.MPG)) + labs(x = "HP", y = "Miles per Gallon", title = "Vehicle efficiency and horsepower")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Output : The relationship between horsepower and miles per gallon is opposite, however it can occasionally be affected by variables like engine type, transmission, and road conditions.

Step 28: Relation between variables Weight and Horsepower

ggplot(data = rawdata) + geom_point(mapping = aes(x=Horsepower..HP., y = Weight..lbs.), position = "jitter", alpha = 0.1) + geom_smooth(mapping = aes(x=Horsepower..HP., y = Weight..lbs.)) + labs(x = "HP", y = "Weight(lbs)", title = "Relation between Weight and Power")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Output : The power needed to pull a vehicle directly relates to how much weight it has. The amount of power needed to lift something heavier grows.

Step 29 : A historical analysis of US car production patterns

ggplot(rawdata, aes(x=Model.Year, y=Production.Share, group=Vehicle.Type, fill=Vehicle.Type, color=Vehicle.Type)) + stat_summary(fun = sum, na.rm=TRUE, geom='line')+ ggtitle('production percentage for various vehicle kinds over time
')+theme(axis.text.x=element_blank(),axis.ticks.x=element_blank()) +theme(axis.text.y=element_blank(),axis.ticks.y=element_blank())

Output : Truck and SUV manufacturing percentages have significantly increased over time. While sedan production has been slowly declining over the years while that of cars and SUVs has been rising, sedan production has consistently outperformed that of cars and SUVs in terms of volume produced.

Step 30 : Relation between MPG and CO2 emission

ggplot(data = rawdata) + geom_point(mapping = aes(x=Real.World.CO2..g.mi., y = Real.World.MPG), position = "jitter", alpha = 0.1) + geom_smooth(mapping = aes(x=Real.World.CO2..g.mi., y = Real.World.MPG)) + labs(x = "CO2 emmission ", y = "Miles per Gallon", title = "Efficiency of fuel according to vehicle class")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Output : A car with a higher mpg often uses less fuel and emits fewer pollutants, whereas a car with a lower mpg uses more fuel and emits more. They therefore have an inverse relationship.

Step 31 : Check the correlation

plot(rawdata)

Output: From the above output we can see that the correlation between numereic variables

Step 32 : Let’s use regression analysis to examine the dataset further.

The creating of a plot to explore the relationship between weight and actual MPG

ggplot(rawdata, aes(x= Real.World.MPG,y=Weight..lbs.))  + geom_smooth(method = "lm",color="pink")+ggtitle("weight vs Real.World.MPG ")
## `geom_smooth()` using formula = 'y ~ x'

Output : We can observe from the graph above that there is a negative linear relationship between Real.World.MPG and weight.

Using a basic linear regression model to fit

LR_Model_Fit <- lm(Weight..lbs.~Real.World.MPG , data = rawdata)
summary(LR_Model_Fit)
## 
## Call:
## lm(formula = Weight..lbs. ~ Real.World.MPG, data = rawdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1368.70  -355.98   -68.04   352.95  1398.64 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4998.77     120.35  41.534   <2e-16 ***
## Real.World.MPG   -50.59       5.88  -8.604   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 503.4 on 382 degrees of freedom
## Multiple R-squared:  0.1623, Adjusted R-squared:  0.1602 
## F-statistic: 74.04 on 1 and 382 DF,  p-value: < 2.2e-16

Output: The Real.World.MPG and y-intercept coefficients’ anticipated slopes, which are 4998.77-50.59 * MPG, show that the fit prediction is accurate.The intercept and weight may not be random events, according to the p-value of each coefficient, which is a significant indication. R2 and modified R2, which show how much of the mpg variance can be explained by the equation, are general fit indicators and show a poor fit.

Need to take alpha as 0.005 in order to check the fit is good or not.

summary_LR_Model_Fit <- summary(LR_Model_Fit)  # the model summary as an object
modelCoeffs <- summary_LR_Model_Fit$coefficients  # need to check cofficients of model
estimation_of_β <- modelCoeffs["Real.World.MPG", "Estimate"] # need to check beta estimations
standard_error <- modelCoeffs["Real.World.MPG", "Std. Error"]  # need to check get standard error 
value_of_t <- estimation_of_β/standard_error  # need to calculate value of  t statistic
value_of_t
## [1] -8.604374
qt(p=0.25,df=380)
## [1] -0.6751359

Output : we reject the null hypothesis at 5% of significance level

cbind(LR_Model_Fit$residuals, LR_Model_Fit$fitted.values) %>%
as.data.frame() %>%
ggplot(aes(y = LR_Model_Fit$residuals, x = LR_Model_Fit$fitted.values)) +
geom_point() + labs(y = "Residuals", x = "Fitted Values")  + theme(text = element_text(size = 16))+geom_hline(yintercept = 0)+ggtitle("Fitted Values vs Residuals")

Output : From the above chart we can see that the spread is scattered throughout the fitted value and expanding across the residual values.

We are currently attempting to determine whether finding the square root of weight might result in a better fit.

sqrtdata<- sqrt(rawdata$Weight..lbs.)
LR_Model_Fit_1 <- lm(sqrtdata~Real.World.MPG ,data=rawdata)
summary(LR_Model_Fit_1)
## 
## Call:
## lm(formula = sqrtdata ~ Real.World.MPG, data = rawdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.8039  -2.7355  -0.4291   2.8432  10.2715 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    71.09649    0.94819  74.982   <2e-16 ***
## Real.World.MPG -0.40517    0.04632  -8.746   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.966 on 382 degrees of freedom
## Multiple R-squared:  0.1669, Adjusted R-squared:  0.1647 
## F-statistic:  76.5 on 1 and 382 DF,  p-value: < 2.2e-16

Output: We received the regression equation as y=71.09649-0.40517x

Take alpha values as 0.05

modelSummary <- summary(LR_Model_Fit_1)  # the model summary as an object
modelCoeffs <- modelSummary$coefficients  # need to check cofficients of model
beta.estimate <- modelCoeffs["Real.World.MPG", "Estimate"]  # need to check beta estimations
std.error <- modelCoeffs["Real.World.MPG", "Std. Error"]  # need to check get standard error 
t_value <- beta.estimate/std.error  # need to calculate value of  t statistic
t_value
## [1] -8.746488
qt(p = .025, df = 380)
## [1] -1.966226

Output : We reject the null hypotheis at 5% of significance

cbind(LR_Model_Fit_1$residuals,LR_Model_Fit_1$fitted.values) %>%
as.data.frame() %>%
ggplot(aes(y = LR_Model_Fit_1$residuals, x = LR_Model_Fit_1$fitted.values)) +
geom_point() + labs(y = "Residuals", x = "Fitted Values",title = "Residuals Vs Fitted Values") +
theme(text = element_text(size = 16))+geom_hline(yintercept = 0)

Output : Despite a wider range between fitted and residual values, the residual values are higher.

Step : Need to find the cube root because if we can find better fit or not.

cuberoot<- log10(rawdata$Weight..lbs.^(1/3))
LR_Model_Fit_2<- lm(cuberoot~Real.World.MPG,data=rawdata)
summary(LR_Model_Fit_2)
## 
## Call:
## lm(formula = cuberoot ~ Real.World.MPG, data = rawdata)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.059297 -0.012435 -0.001512  0.013518  0.043848 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.2365187  0.0043514 284.166   <2e-16 ***
## Real.World.MPG -0.0018838  0.0002126  -8.861   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0182 on 382 degrees of freedom
## Multiple R-squared:  0.1705, Adjusted R-squared:  0.1683 
## F-statistic: 78.52 on 1 and 382 DF,  p-value: < 2.2e-16
summary_LR_Model_Fit <- summary(LR_Model_Fit_2)  # the model summary as an object
coefficients_LR_Model_Fit <- modelSummary$coefficients  # need to check cofficients of model
estimation_of_β <- coefficients_LR_Model_Fit["Real.World.MPG", "Estimate"]  # need to check beta estimations
standard.error <- coefficients_LR_Model_Fit["Real.World.MPG", "Std. Error"]  # need to check get standard error 
value_of_t <- estimation_of_β/standard.error  # need to calculate value of  t statistic
value_of_t
## [1] -8.746488
qt(p=0.25,df=380)
## [1] -0.6751359

Output: We can conclude that the regression line is fitted if we reject the null hypothesis.

ggplot(rawdata, aes(x=Real.World.MPG,y=cuberoot)) + geom_point(color= "blue") + geom_smooth(method = "lm",color="yellow")+ggtitle("Real.World.MPG VS cuberoot")
## `geom_smooth()` using formula = 'y ~ x'

cbind(LR_Model_Fit_2$residuals, LR_Model_Fit_2$fitted.values) %>%
as.data.frame() %>%
ggplot(aes(y = LR_Model_Fit_2$residuals, x = LR_Model_Fit_2$fitted.values)) +
geom_point() + labs(y = "Residuals", x = "Fitted Values",title="Residuals vs Fitted Values") +
theme(text = element_text(size = 16))+geom_hline(yintercept = 0)

#Along the fitted values, points are dispersed, and the residual values show little change.

#Regression modeling improvement through the use of additional significant data set variables

LR_Model_Fit=lm(formula = Real.World.MPG ~  
                       poly(Real.World.CO2..g.mi., 4) + 
                       poly(Horsepower..HP.,4) + 
                       poly(Weight..lbs., 4) ,
                    data = rawdata)
summary(LR_Model_Fit)
## 
## Call:
## lm(formula = Real.World.MPG ~ poly(Real.World.CO2..g.mi., 4) + 
##     poly(Horsepower..HP., 4) + poly(Weight..lbs., 4), data = rawdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45704 -0.02449 -0.00620  0.01536  0.24209 
## 
## Coefficients:
##                                   Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                      19.996818   0.003260 6133.661  < 2e-16 ***
## poly(Real.World.CO2..g.mi., 4)1 -81.923344   0.170453 -480.622  < 2e-16 ***
## poly(Real.World.CO2..g.mi., 4)2  25.043496   0.078730  318.094  < 2e-16 ***
## poly(Real.World.CO2..g.mi., 4)3  -6.514450   0.068825  -94.652  < 2e-16 ***
## poly(Real.World.CO2..g.mi., 4)4   1.240789   0.068390   18.143  < 2e-16 ***
## poly(Horsepower..HP., 4)1        -1.230356   0.253400   -4.855 1.77e-06 ***
## poly(Horsepower..HP., 4)2        -0.172138   0.098917   -1.740 0.082649 .  
## poly(Horsepower..HP., 4)3        -0.004813   0.083443   -0.058 0.954035    
## poly(Horsepower..HP., 4)4         0.564276   0.073859    7.640 1.87e-13 ***
## poly(Weight..lbs., 4)1            0.925597   0.269905    3.429 0.000673 ***
## poly(Weight..lbs., 4)2            0.364124   0.102364    3.557 0.000423 ***
## poly(Weight..lbs., 4)3            0.390590   0.086846    4.498 9.21e-06 ***
## poly(Weight..lbs., 4)4           -0.284783   0.069246   -4.113 4.82e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06389 on 371 degrees of freedom
## Multiple R-squared:  0.9998, Adjusted R-squared:  0.9998 
## F-statistic: 1.496e+05 on 12 and 371 DF,  p-value: < 2.2e-16

Output : From the above calculations we can see that there are some variables which are not significant, So we need to adjust the model by removing irrelevant values.

LR_Model_Fit1=lm(formula = Real.World.MPG ~  
                       poly(Real.World.CO2..g.mi.,4) + 
                       poly(Horsepower..HP.) + 
                       poly(Weight..lbs.) ,
                    data = rawdata)
summary(LR_Model_Fit1)
## 
## Call:
## lm(formula = Real.World.MPG ~ poly(Real.World.CO2..g.mi., 4) + 
##     poly(Horsepower..HP.) + poly(Weight..lbs.), data = rawdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58635 -0.02053 -0.00846  0.01247  0.26011 
## 
## Coefficients:
##                                   Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                      19.996818   0.003626 5514.140   <2e-16 ***
## poly(Real.World.CO2..g.mi., 4)1 -81.682386   0.178860 -456.683   <2e-16 ***
## poly(Real.World.CO2..g.mi., 4)2  24.885182   0.084085  295.952   <2e-16 ***
## poly(Real.World.CO2..g.mi., 4)3  -6.572361   0.073707  -89.169   <2e-16 ***
## poly(Real.World.CO2..g.mi., 4)4   1.381456   0.073231   18.864   <2e-16 ***
## poly(Horsepower..HP.)            -0.293999   0.252005   -1.167    0.244    
## poly(Weight..lbs.)               -0.063953   0.271040   -0.236    0.814    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07106 on 377 degrees of freedom
## Multiple R-squared:  0.9997, Adjusted R-squared:  0.9997 
## F-statistic: 2.419e+05 on 6 and 377 DF,  p-value: < 2.2e-16

Output : The p value must be less than 0.05 in order to show that the data are statistically significant because all of the variables are now significant and we have a high r square value.

PCA (PRINCIPLE COMPONENT ANALYSIS)

Need to study columns

new_data <- rawdata [, c(5,8,11,12)]
Data_Scaled = scale(new_data)
head(Data_Scaled)
##      Real.World.MPG Real.World.CO2..g.mi. Weight..lbs. Horsepower..HP.
## [1,]      -1.585659              1.990871    0.1334871      -0.8204079
## [2,]      -1.495341              1.805174    0.1281990      -0.8408112
## [3,]      -1.494541              1.803580    0.1283282      -0.8402878
## [4,]      -1.911468              2.765575    0.1555482      -0.7352955
## [5,]      -1.847364              2.598289    0.0453413      -0.7558404
## [6,]      -2.032214              3.103616    0.3797665      -0.7148259

#Analyzing the data to determine its covariance and correlation

covariance_Matrix = cov(new_data)
covariance_Matrix
##                       Real.World.MPG Real.World.CO2..g.mi. Weight..lbs.
## Real.World.MPG              19.13986             -448.3012    -968.3456
## Real.World.CO2..g.mi.     -448.30121            11551.9612   21682.5655
## Weight..lbs.              -968.34562            21682.5655  301774.0027
## Horsepower..HP.             58.26358            -1579.3562   22821.8769
##                       Horsepower..HP.
## Real.World.MPG               58.26358
## Real.World.CO2..g.mi.     -1579.35616
## Weight..lbs.              22821.87693
## Horsepower..HP.            3111.97319

Correlation Matrix

correlationMatrix = cor(new_data)
correlationMatrix
##                       Real.World.MPG Real.World.CO2..g.mi. Weight..lbs.
## Real.World.MPG             1.0000000            -0.9533945   -0.4029211
## Real.World.CO2..g.mi.     -0.9533945             1.0000000    0.3672332
## Weight..lbs.              -0.4029211             0.3672332    1.0000000
## Horsepower..HP.            0.2387316            -0.2634112    0.7447192
##                       Horsepower..HP.
## Real.World.MPG              0.2387316
## Real.World.CO2..g.mi.      -0.2634112
## Weight..lbs.                0.7447192
## Horsepower..HP.             1.0000000

Tranpose of covariance Matrix

transpose_of_covariance_Matrix <- t(covariance_Matrix) 
multiply = covariance_Matrix%*%transpose_of_covariance_Matrix
multiply
##                       Real.World.MPG Real.World.CO2..g.mi. Weight..lbs.
## Real.World.MPG               1142428             -26275575   -300630704
## Real.World.CO2..g.mi.      -26275575             606276793   6758100965
## Weight..lbs.              -300630704            6758100965  92059458115
## Horsepower..HP.            -21209007             471651146   6923769308
##                       Horsepower..HP.
## Real.World.MPG              -21209007
## Real.World.CO2..g.mi.       471651146
## Weight..lbs.               6923769308
## Horsepower..HP.             533020204

Output : When we combine the covariance matrix with its transposition, the resulting matrix is orthogonal if the identity matrix. The fact that it is not an identity matrix renders it non-orthogonal.

calculating the eigenvalues and eigenvectors for the covariance and correlation matrix

eigenResidual1 = eigen(covariance_Matrix)
eigenResidual1$values
## [1] 3.050861e+05 1.105685e+04 3.126774e+02 1.419667e+00
eigenResidual1$vectors
##              [,1]        [,2]        [,3]         [,4]
## [1,]  0.003249883 -0.03610169  0.01691637  0.999199651
## [2,] -0.073064688  0.94587191  0.31486842  0.029081875
## [3,] -0.994514555 -0.04560166 -0.09408069  0.003179807
## [4,] -0.074778268 -0.31928589  0.94430956 -0.027279867
eigenResidual2 = eigen(correlationMatrix)
eigenResidual2$values
## [1] 2.20082486 1.71075588 0.05046492 0.03795434
eigenResidual2$vectors
##             [,1]       [,2]       [,3]       [,4]
## [1,]  0.65047130 -0.1631404  0.4367088  0.5996314
## [2,] -0.64405470  0.1873250  0.7024558  0.2380309
## [3,] -0.40203020 -0.6041091 -0.3977543  0.5614404
## [4,]  0.02126844 -0.7571966  0.3970300 -0.5182356

Output : The covariance matrix and the correlation matrix both have different eigenvalues as vectors, despite the fact that both matrices are connected to one another. The eigenvectors differ in signs from one another for some variables.

Spectral decomposition method used to square the covariance matrix

Squareroot <- eigenResidual1$vectors %*% diag(sqrt(eigenResidual1$values)) %*% t(eigenResidual1$vectors)
Squareroot
##           [,1]       [,2]       [,3]       [,4]
## [1,]  1.337533  -3.593018  -1.636459   1.327816
## [2,] -3.593018  98.779114  35.076372 -23.481616
## [3,] -1.636459  35.076372 546.678104  41.036858
## [4,]  1.327816 -23.481616  41.036858  29.577020

Output : By multiplying eigen vectors by the square roots of the diagonals and then by the eigen vectors’ transpose, we can use the spectral decomposition method to determine the square root of the covariance matrix.

Analyze the number of principal components necessary to capture at least 90% of the data variability.

Percent_Variance_Explained <-eigenResidual2$values / sum(eigenResidual2$values)
Percent_Variance_Explained 
## [1] 0.550206216 0.427688969 0.012616230 0.009488584

Need to check Cummulative percent variance

cumsum(Percent_Variance_Explained)
## [1] 0.5502062 0.9778952 0.9905114 1.0000000

Need to plot Percent_Variance_Explained

plot(Percent_Variance_Explained)

Need to plot Cummulative percent variance

plot(cumsum(Percent_Variance_Explained))

Output : We limit our features to three when the third one exceeds the 90% threshold.

Need to calculate the Principal component vectors

eigen_vectors2 = eigenResidual2$vectors[,1:2]
eigen_vectors2
##             [,1]       [,2]
## [1,]  0.65047130 -0.1631404
## [2,] -0.64405470  0.1873250
## [3,] -0.40203020 -0.6041091
## [4,]  0.02126844 -0.7571966
colnames(eigen_vectors2) = c("pc1", "pc2")
row.names(eigen_vectors2) = colnames(new_data)
eigen_vectors2
##                               pc1        pc2
## Real.World.MPG         0.65047130 -0.1631404
## Real.World.CO2..g.mi. -0.64405470  0.1873250
## Weight..lbs.          -0.40203020 -0.6041091
## Horsepower..HP.        0.02126844 -0.7571966

Output : MPG and horsepower are directly correlated with the first principal component (PC1), while real-world CO2 and weight are adversely correlated.MPG, weight, and horse power are inversely connected to the second primary component (PC2), which indicates that emissions are decreased when a vehicle is more fuel-efficient and has the appropriate weight and horse power.

CONCLUSION

Future work and Limitations : It would be more useful to describe production share and footprint as numerical variables so that a more thorough study could be done to determine the share of various vehicle types across all variables.

Learnings: According to the data analysis, the car’s fuel efficiency has been at an all-time high, and its emissions are at an all-time low. Even though SUVs are less efficient than sedans, there has been a noticeable trend in favor of them. Since the performance of SUVs has significantly improved over time and the comfort they offer is a good trade-off from an efficiency standpoint, we can conclude that SUV sales have increased relative to sedan sales.The data also reveals how numerous factors, including emissions, weight, horsepower, footprint, and efficiency, relate to one another.

Reference :

Please see the link below for an examination of some of the readings in order to gain a better understanding of the data.

https://www.statology.org/

https://www.epa.gov/system/files/documents/2022-12/420s22001.pdf